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SIGNAL  PROCESSING  ALGORITHMS  FOR  HETEROGENEOUS  ARCHITECTURES 

1  JULY  1987  -  31  OCTOBER  1991 
FINAL  REPORT 


l.A.  STATEMENT  OF  THE  PROBLEM  STUDIED 


As  stated  in  our  proposal,  our  research  goal  was  to  "derive  architectures  which 
preserve  the  VLSI  implementation  advantages  of  regular  meshes  while  still  obtaining 
sufficient  increases  in  performance  and/or  decreases  in  size  and  cost."  Our  problem  domain 
was  signal  and  image  processing  applications  which  require  either  significant  computational 
resources  or  which  must  be  solved  in  real  time.  Our  research  has  been  successful  in  that  we 
have  derived  a  number  of  such  architectures  for  important  problem  areas  in  image  and  signal 
processing.  These  architectures  are  outlined  in  more  detail  in  the  next  section.  We  even  suc¬ 
cessfully  prototyped  one  of  our  architectures,  the  Arithmetic  Cube,  which  involved  both  VLSI 
chip  and  PCB  board  design  and  fabrication.  The  Arithmetic  Cube,  directly  solves  both  the 
discrete  Fourier  transform  and  cyclic  convolution.  In  parallel  with  our  architecture  work,  we 
also  investigated  algorithms  which  map  efficiently  onto  the  various  derived  architectures  and 
which  take  advantage  of  their  particular  architectural  aspects. 


-J.  S'  K. 
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l.B.  SUMMARY  OF  THE  MOST  IMPORTANT  RESULTS 


The  results  of  the  research  supported  by  this  three  year  ARO  grant  falls  in  two  primary 
areas:  architecture  and  algorithms,  and  in  one  secondary  area:  design  tools.  We  will  address 
the  research  results  in  each  of  these  areas  in  turn. 

In  the  architecture  area  we  developed  a  number  of  new  systolic  VLSI  architectures  for 
computationally  demanding  applications.  The  first  such  systolic  VLSI  architecture  can 
achieve  real-time  isolated  word  recognition  for  large  dictionaries.  The  design  is  based  on  the 
dynamic  time  warping  (DTW)  algorithm,  an  exhaustive  search  technique  which  permits  non¬ 
linear  pattern  matching  between  an  unknown  utterance  and  a  reference  word.  Our  design 
differs  from  previous  systolic  DTW  designs  in  that  (1)  all  data  is  represented  in  signed-digit, 
base  4  format;  (2)  digits  are  passed  between  processing  elements  in  a  most  significant  digit 
first,  digit  serial  fashion;  and  (3)  the  algorithms  are  pipelined  at  the  digit  level.  Using  most 
significant  digit  first  data  flow  allows  digit  pipelining  to  succeed  where  conventional  bit-serial 
pipelining  has  failed  for  the  arithmetic  operations  required  in  the  DTW  algorithm.  This  al¬ 
lows  a  very  high  degree  of  concurrency  and  a  high  data  rate  to  be  maintained,  while  the  pin 
out  requirements  are  kept  low.  The  VLSI  DTW  design  presented  is  both  flexible  and  modu¬ 
lar.  It  is  independent  of  the  number  of  coefficients  per  frame  and  the  precision  of  those 
coefficients.  The  design  is  also  easily  expandable  in  the  number  of  frames  per  word  and  the 
warp  factor  used  to  achieve  the  nonlinear  matching.  Using  our  dynamic  time  warp  processor 
design,  real  time  word  recognition  of  100,000  word  vocabulary  is  possible.  Furthermore,  be¬ 
cause  the  processor  is  area  efficient  only  a  few  VLSI  chips  would  be  needed. 

Another  of  our  new  systolic  VLSI  architectures  can  be  used  to  determine  how  well 
two  two-dimensional  images  match.  These  space  warp  arrays  are  similar  to  the  time  warp  ar¬ 
ray  except  that  instead  of  matching  in  one  temporal  dimension  (amplitude  vs  time),  the  space 
warper  matches  in  two  spatial  dimensions  (amplitude  vs  space  x  space).  In  this  way  the 
dynamic  space  warp  VLSI  array  is  the  two-dimensional  analog  to  the  dynamic  time  warp  ar¬ 
ray  for  one  dimensional  signals.  Such  arrays  can  be  used  to  form  the  basis  of  an  image 
recognition  system.  The  design  of  the  arrays  is  based  on  being  able  to  expand  and  contract 
the  two  images  with  respect  to  one  another  so  that  a  best  matching  is  obtained.  The  one  ar¬ 
ray,  the  image  warp  array,  expands  and  contracts  the  two  entire  images  with  respect  to  one 
another  so  that  a  best  matching  can  be  obtained.  The  other  array,  the  line  warp  array,  finds, 
for  each  vertical  or  horizontal  line  in  one  image,  the  best  match  for  that  line  in  the  second  im¬ 
age.  The  distance  between  the  two  images  is  the  sum  of  all  the  line  distances  for  all  the  vert¬ 
ical  and  horizontal  lines.  An  advantage  of  the  line  warp  array  is  that  it  involves  a  three- 
dimensional  wave  front  array  versus  the  four-dimensional  wave  front  array  required  for  the 
image  warp  array.  We  show  that  an  efficient  and  robust  VLSI  design  is  possible  if  local 
feature  matching  is  used.  For  example  it  would  be  possible  to  build  a  space  warper  which 
could  match  (in  the  same  sense  that  two  word  are  matched  by  a  time  warper)  thirty  1024  by 
1024  pictures  per  second. 

Yet  another  of  our  new  systolic  VLSI  architectures,  which  we  call  the  Arithmetic 
Cube,  directly  solves  both  the  discrete  Fourier  transform  and  convolution  based  on  the  so- 
called  small  n  algorithms.  Most  VLSI  architectures  proposed  for  computing  these  functions 


use  either  a  straightforward  matrix  times  vector  approach  or  use  data  shuffling  to  reduce  the 
number  of  arithmetic  operations.  However,  matrix  times  vector  architectures  compute  many 
more  operations  than  are  necessary.  While  data  shuffling  architectures  have  VLSI  implemen¬ 
tation  problems,  in  part,  due  to  the  inability  of  obtaining  a  bounded  length  or  nearest  neighbor 
interconnect  structure  in  VLSI.  The  Arithmetic  Cube  computes  no  more  multiplications  than 
are  necessaiy  and  has  a  regular  bounded  length,  nearest  neighbor  interconnect  structure.  Us¬ 
ing  funds  received  from  the  National  Science  Foundation  a  full  prototype  of  the  Arithmetic 
Cube  was  constructed.  This  prototype  is  presently  fully  operational  and  is  installed  in  a  Sun-4 
workstation. 

Also  in  the  architecture  area  we  developed  three  new  systolic,  very  fine  grained  archi¬ 
tectures.  To  maintain  their  fine  grainness  most  fine  grain  processors  are  relatively  inflexible. 
Any  attempt  to  increase  flexibility  increases  processor  complexity  and,  thereby,  decreases 
grainness.  The  architecture  we  developed  maintains  both  a  high  degree  of  flexibility  and  fine 
grainness.  These  micro  grained  architectures  are  especially  suited  for  problems  with  a  high 
degree  of  parallelism.  In  the  first  of  these  architectures  each  processor  is  an  associative 
memory  word.  However,  unlike  other  associative  memory  processors,  ours  uses  a  two- 
dimensional  interconnect  and  a  physically  compact  memory  word  structure.  Arithmetic  opera¬ 
tions  are  based  on  the  use  of  a  redundant  number  system.  These  features  provide  an  even 
higher  level  of  performance.  This  is  particularly  true  for  certain  two-dimensional  problems. 
For  example,  we  were  able  to  show  that  an  array  which  is  capable  of  performing  a  3x3  two- 
dimensional  Laplacian  over  a  256x256  image  in  as  little  as  200  microseconds  using  as  few  as 
sixteen  VLSI  chips  could  be  built. 

The  other  two  micro  grained  architectures  were  inspired  by  our  interest  in  deriving  ar¬ 
chitectures  suitable  for  implementation  by  programmable  logic  devices  (PLD’s).  Like  PLD’s 
the  processors  (cells)  of  our  architectures  consist  of  a  few  small  RAM’s  and  a  local  intercon¬ 
nect  structure.  Unlike  PLD’s,  our  architectures  use  a  less  general  global  interconnect  struc¬ 
ture.  However,  what  we  loose  in  generality,  we  make  up  for  in  speed  especially  for  the  prob¬ 
lems  of  interest  to  us.  These  two  architectures  differ  mostly  in  how  the  local  function  is  com¬ 
puted.  The  one  architecture  uses  RAM’s  just  like  some  varieties  of  PLD’s.  The  other  uses 
muxes  just  like  other  varieties  of  PLD’s.  We  have  shown  (via  fabrication  through  MOSIS) 
that  VLSI  chip  containing  over  4096  processors  can  be  built. 

Another  example  of  our  work  in  the  area  of  architectures  for  computationally  demand¬ 
ing  applications,  is  a  simple  memory  array  architecture  we  call  the  Access  Constrained 
Memory  Array  Architecture  (ACMAA).  The  architecture  consists  of  a  linear  array  of  proces¬ 
sors  concurrently  accessing  rows  or  columns  on  an  array  of  memory  modules.  This  organiza¬ 
tion  has  a  simple  and  regular  structure  and  hence  is  very  efficient  for  VLSI  implementation. 
Parallel  algorithms  for  various  graph  algorithms  (bridge  detection,  transitive  closure,  shortest 
path),  for  image  enhancement  (contrast  enhancement  and  edge  detection),  and  for 
region(connected  component)  labeling  have  been  developed  which  take  advantage  of  the  struc¬ 
ture  of  the  memory  architecture. 

We  have  also  developed  several  architectures  for  signal  processing  problems  which  re¬ 
quire  designs  with  feedback  paths.  Such  architectures  are  often  limited  in  performance  be¬ 
cause  of  the  time  required  to  wait  for  the  feedback  data.  In  particular,  we  developed  an  archi¬ 
tecture  for  HR  filters  based  on  a  new  algorithm.  We  have  also  found  that  digit  serial  architec- 
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tures,  with  both  high  throughput  and  low  system  latency,  can  be  used  very  effectively  in 
designs  with  feedback  paths.  Such  architectures  can  exploit  digit  pipelining  -  overlapping  the 
input  with  the  computation.  Our  work  has  concentrated  on  the  design  of  various  IIR  filters 
based  on  digit  serial  components. 

Finally,  in  the  architecture  area  we  have  investigated  a  number  of  architectures  which 
can  perform  matrix  transpositions  for  two  and  higher  dimensions.  Efficient  solutions  to  this 
problem  would  be  of  use  in  many  signal  processing  architectures.  Currently,  we  are  examin¬ 
ing  two  approaches.  The  first  is  a  conceptual  approach  applicable  to  matrices  of  any  dimen¬ 
sion.  In  this  approach,  matrix  elements  are  placed  in  equivalence  classes  under  the  operation 
of  cyclic  index  shift.  The  transpose  operation  is  effected  by  permuting  elements  within  each 
equivalence  class.  We  are  studying  methods  for  performing  the  permutations.  The  second 
approach  is  systolic  in  nature  and  is  applicable  to  two  dimensional  matrices.  Each  cell  con¬ 
nects  to  three  neighbors  to  the  left,  three  to  the  right.  The  cell  consists  of  a  three-by-three 
crossbar  switch  and  some  simple  control  for  routing  the  inputs  from  the  left  to  the  output  on 
the  right.  The  routing  algorithm  is  completely  local  in  nature. 

Next  we  outline  our  results  in  the  algorithm  area.  Our  algorithm  work  has,  to  some 
degree,  paralleled  our  architecture  work. 


In  the  area  of  image  recognition  we  have  developed  algorithms  based  on  the  concepts 
of  local-distance  diagrams,  dynamic  programming,  and  the  minimum  principle.  The  first  algo¬ 
rithm  (i.e.,  the  dynamic  space-warping  algorithm  (DSWA))  is  used  to  find  the  minimum  dis¬ 
tance  between  two  areas.  The  second  algorithm  (i.e.,  the  dynamic  line-warping  algorithm 
(DLWA))  is  used  to  find  the  minimum  distance  between  a  line  and  an  area.  For  each  of  these 
image  recognition  algorithms,  we  need  to  consider  both  recognition  rates  and  computational 
performance.  We  have  shown  experimentally  that  the  DSWA  has  good  recognition  perfor¬ 
mance.  However,  more  experiments  need  to  be  done  to  verify  that  the  DLWA  has  satisfacto¬ 
ry  recognition  rates.  Intuitively,  the  DSWA  can  be  implemented  with  a  four-dimensional 
wavefront  architecture  and  the  DLWA  can  be  implemented  with  a  three-dimensional  wave- 
front  architecture. 


We  developed  several  new  families  of  algorithms  for  the  Arithmetic  Cube  II.  As  it 
stands  the  Arithmetic  Cube  II  is  "optimized"  for  computing  the  small  n  DFT  and  convolution 
algorithms  for  which  n  lies  in  a  small  range.  Algorithms  for  which  n  is  too  large  simply  can 
not  be  computed.  Algorithms  for  which  n  is  too  small  under  utilize  the  hardware  on  the 
Cube.  To  alleviate  this  situation,  we  have  also  shown  how  changing  the  dimensionality  of  a 
transform  can  be  used  to  efficiently  compute  an  arbitrary  sized  problem  on  an  Arithmetic 
Cube  of  given  size.  For  example  we  showed  how  either  the  discrete  Fourier  transform  of  n 
points  or  the  cyclic  -convolution  of  two  length  n  sequences  can  be  performed  on  the  Arithmet¬ 


ic  Cube  whose  area  is  A 


in  time  0(-=-  log(-p=-)+ V/T).  This  time  bound  is  within  a 
sA 


n 

<A 


log(— p=-)  factor  of  the  lower  bound  of  Q(-~  +  ^fn  ).  Note  that  even  if  n  and  A  differ  by 
vA  SA 

several  orders  of  magnitude,  this  factor  is  small.  Furthermore,  even  for  a  small  Arithmetic 
Cube,  our  bound  is  no  worse  than  performing  the  fast  Fourier  transform.  We  have  also 
developed  several  new  small  n  algorithms  for  the  Cube  which  make  better  use  of  its  architec¬ 
ture. 
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We  developed  a  new  algorithm  for  IIR  filters.  This  algorithm  obtains  a  high  degree  of 
parallelism,  lower  computational  complexity  than  the  block  methods,  and  higher  stability  than 
the  back  substitution  methods. 

We  have  either  modified  existing  algorithms  or  developed  new  algorithms  for  our  mi¬ 
cro  grained  and  Access  Constrained  Memory  Array  Architectures.  These  algorithms  include 
various  graph  algorithms  (bridge  detection,  transitive  closure,  shortest  path),  image  enhance¬ 
ment  (contrast  enhancement  and  edge  detection)  algorithms,  and  region  (connected  com¬ 
ponent)  labeling  algorithm.  Also  arithmetic  algorithms  for  the  elementary  functions  (addition, 
multiplication,  division,  trig,  etc)  have  been  developed  which  are  especially  suited  for  our  mi¬ 
cro  grained  architectures. 

While  our  architecture  and  algorithm  work  was  (and  still  is)  the  driving  force  of  our 
research,  we  also  have  developed  and/or  acquired  the  necessary  CAD  tools  to  enable  us  to 
quickly  prototype  our  architectures.  In  particular,  we  have  developed  the  CAD  tools  which 
allow  us  to  directly  synthesis  complete  processors.  Our  processor  synthesizer  takes  as  input  a 
hierarchical  description  of  the  computation  to  be  performed  and  some  hint  as  to  how  it  is  to 
be  performed  and  outputs  a  detailed  report  on  the  hardware  requirements  for  the  specified 
computation,  including  the  structure  of  each  compute  node,  how  they  are  connected,  and  the 
input  and  output  sequences.  Our  module  generation  takes  as  input  a  hierarchical  netlist 
description  of  a  module  and  outputs  a  CMOS  VLSI  layout  of  the  module. 
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DEPARTMENT  OF  THE  ARMY 
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January  25,  1993 
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possible  patent  application. 

This  office  offers  no  objection  to  the  inventor  seeking  patent 
for  the  subject  invention  provided: 
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RtPlY  TO  ATTENTION  Of: 


Procurement  Office  P-20090-EL 
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Invention  Disclosures:  PSU  86-751,  86-752  and  86-753 
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Director 
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By  your  letter  of  September  11,  1987  you  disclosed  inventions 
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UNIVERSITY  PARK,  PA  16802 

Forcest  J.  Remick 

Associate  Vice  President  for  Research  November  19  ,  1987 
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814-865-6331 


Professor  Robert  M.  Owens 
College  of  Science 
The  Pennsylvania  State  University 
308  Whitmore  Laboratory 
University  Park,  PA  16802 

Professor  Mary  Jane  Irwin 
College  of  Science 
The  Pennsylvania  State  University 
305  Whitmore  Laboratory 
University  Park,  PA  16802 

Re:  "An  Area  Efficient  VLSI  FIR  Filter," 

by  R.  M.  Owens  and  M.  J.  Irwin 
Our  Ref:  86-751 

Professors : 

In  response  to  your  request  dated  July  14,  1987,  having  reviewed 
the  various  factual  elements  requisite  to  release,  the  University  agrees 
to  release  to  you  the  invention  referenced  above  on  the  condition  you 
agree : 

1.  Not  to  use  the  University  or  the  University's  name  in  the 
exploitation  of  such  invention. 

2.  That  the  University  may  retain  a  license  for  University 
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3.  That  you  will  convey  to  the  University  or  any  sponsor  such 
rights  as  are  necessary  to  fulfill  the  obligations  of  the 
University  to  other  parties,  specifically,  that  you  will 
comply  with  the  federal  government  requirements  set  forth  in 
the  attached  letter. 
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Our  Ref:  86-751 


Please  indicate  your  agreement  and  acceptance  of  this  condition  by  adding 
your  signatures  and  the  dates  thereof  in  the  spaces  provided  below  before 
returning  this  letter  to  me. 
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